{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 读入数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def loadDataSet(filename):\n",
    "    frTrain = open(filename)\n",
    "    featureSet = []\n",
    "    labels = []\n",
    "    for line in frTrain.readlines():\n",
    "        currLine = line.strip().split('\\t')\n",
    "        lineArr = []\n",
    "        for i in range(21):\n",
    "            lineArr.append(float(currLine[i]))\n",
    "        featureSet.append(lineArr)\n",
    "        labels.append(float(currLine[21]))\n",
    "    return featureSet, labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sigmoid 函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def sigmoid(inX):\n",
    "    return 1.0/(1+np.exp(-inX))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 梯度上升优化算法\n",
    "**梯度上升伪代码：**  \n",
    "每个回归系数初始化为1  \n",
    "重复R次：  \n",
    "&ensp;&ensp;&ensp;&ensp;计算整个数据集的梯度  \n",
    "&ensp;&ensp;&ensp;&ensp;使用 alpha\\*gradient 更新回归系数的向量  \n",
    "返回回归系数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def gradAscent(dataMatIn, classLabels):\n",
    "    # 将数据转换成 numpy 矩阵\n",
    "    dataMatrix = np.mat(dataMatIn)    # 100*3\n",
    "    labelMat = np.mat(classLabels).transpose()    # 1*100 --> 100*1\n",
    "    m, n = np.shape(dataMatrix)\n",
    "    alpha = 0.001    # 步长\n",
    "    maxCycles = 500    # 迭代次数\n",
    "    weights = np.ones((n, 1))    # n*1\n",
    "    for k in range(maxCycles):\n",
    "        h = sigmoid(dataMatrix * weights)    # (100*3)*(3*1) --> 100*1\n",
    "        error = (labelMat - h)    # 100*1\n",
    "        weights = weights + alpha * dataMatrix.transpose() * error\n",
    "    return weights"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 随机梯度上升算法\n",
    "随机梯度上升算法：一次仅用一个样本点来更新回归系数  \n",
    "**随机梯度上升伪代码：**  \n",
    "每个回归系数初始化为1  \n",
    "对数据集中每个样本：  \n",
    "&ensp;&ensp;&ensp;&ensp;计算该样本的梯度  \n",
    "&ensp;&ensp;&ensp;&ensp;使用 alpha\\*gradient 更新回归系数的向量  \n",
    "返回回归系数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def stocGradAscent0(dataMatrix, classLabels):\n",
    "    m, n = np.shape(dataMatrix)\n",
    "    alpha = 0.01\n",
    "    weights = np.ones(n)    # 1*n\n",
    "    for i in range(m):\n",
    "        h = sigmoid(sum(dataMatrix[i]*weights))\n",
    "        error = classLabels[i] - h\n",
    "        weights = weights + alpha * error * dataMatrix[i]\n",
    "    return weights"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 改进的随机梯度上升算法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def stocGradAscent1(dataMatrix, classLabels, numIter=150):\n",
    "    m, n = np.shape(dataMatrix)\n",
    "    weights = np.ones(n)    # 1*n\n",
    "    for j in range(numIter):\n",
    "        dataIndex = list(range(m))\n",
    "        for i in range(m):\n",
    "            alpha = 4 / (1.0 + j + i) + 0.05    # 每次迭代调整alpha\n",
    "            randIndex = int(np.random.uniform(0, len(dataIndex)))    # 随机选取样本   \n",
    "            h = sigmoid(sum(dataMatrix[randIndex]*weights))\n",
    "            error = classLabels[randIndex] - h\n",
    "            weights = weights + alpha * error * dataMatrix[randIndex]\n",
    "            del(dataIndex[randIndex])\n",
    "    return weights"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 准备数据\n",
    "处理特征缺失值常用方法\n",
    "- 使用可用特征的均值来补缺失值\n",
    "- 使用特殊值填补，如-1\n",
    "- 忽略有缺失值的样本\n",
    "- 使用相似样本的均值来补缺失值\n",
    "- 使用另外的机器学习算法预测缺失值  \n",
    "处理类别标签缺失值方法： 丢弃该条样本  \n",
    "  \n",
    "本节中根据逻辑回归的特点，将缺失值用0替代，既可保留数据，也不影响优化算法\n",
    "此节使用已处理好的数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Logistic回归分类函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def classifyVector(inX, weights):\n",
    "    prob = sigmoid(sum(inX * weights))\n",
    "    if prob > 0.5:\n",
    "        return 1.0\n",
    "    else:\n",
    "        return 0.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def colicTest():\n",
    "    # 读入训练数据\n",
    "    trainingSet,trainingLabels = loadDataSet('horseColicTraining.txt')\n",
    "    # 使用随机梯度下降进行系数学习\n",
    "    trainWeights = stocGradAscent1(np.array(trainingSet), trainingLabels, 666)\n",
    "    errorCount = 0\n",
    "    # 读入测试数据\n",
    "    testSet,testLabels = loadDataSet('horseColicTest.txt')\n",
    "    for i in range(len(testLabels)):\n",
    "        if classifyVector(testSet[i], trainWeights) != testLabels[i]:\n",
    "            errorCount += 1\n",
    "    # 计算错误率\n",
    "    errorRate = float(errorCount)/float(len(testLabels))\n",
    "    print('the error rate of this test is: %f' % errorRate)\n",
    "    return errorRate\n",
    "\n",
    "def multiTest():\n",
    "    numTests = 10\n",
    "    errorSum = 0.0\n",
    "    for k in range(numTests):\n",
    "        errorSum += colicTest()\n",
    "    print('after %d iterations the average error rate is: %f' % (numTests, errorSum/float(numTests)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "the error rate of this test is: 0.343284\n",
      "the error rate of this test is: 0.313433\n",
      "the error rate of this test is: 0.328358\n",
      "the error rate of this test is: 0.268657\n",
      "the error rate of this test is: 0.388060\n",
      "the error rate of this test is: 0.343284\n",
      "the error rate of this test is: 0.447761\n",
      "the error rate of this test is: 0.447761\n",
      "the error rate of this test is: 0.447761\n",
      "the error rate of this test is: 0.358209\n",
      "after 10 iterations the average error rate is: 0.368657\n"
     ]
    }
   ],
   "source": [
    "multiTest()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
